Synthesizing theories of human language with Bayesian program induction

Automated, data-driven construction and evaluation of scientific models and theories is a long-standing challenge in artificial intelligence. We present a framework for algorithmically synthesizing models of a basic part of human language: morpho-phonology, the system that builds word forms from sounds. We integrate Bayesian inference with program synthesis and representations inspired by linguistic theory and cognitive models of learning and discovery. Across 70 datasets from 58 diverse languages, our system synthesizes human-interpretable models for core aspects of each language’s morpho-phonology, sometimes approaching models posited by human linguists. Joint inference across all 70 data sets automatically synthesizes a meta-model encoding interpretable cross-language typological tendencies. Finally, the same algorithm captures few-shot learning dynamics, acquiring new morphophonological rules from just one or a few examples. These results suggest routes to more powerful machine-enabled discovery of interpretable models in linguistics and other scientific domains.

Surface and underlying forms. The final pronounced form of a word is referred to as its surface form, conventionally written between brackets, while intermediate forms such as entries in the lexicon are called underlying forms and are written between slashes. For example, the English past tense of the word walk (i.e. "walked") has the surface form [wOkt], but the underlying form /wOkd/, which is built from the underlying form of the past tense (i.e. /d/) and the underlying form of the stem for walk (i.e. /wOk/). In the main text we have adopted a simplified presentation where all (sequences of) phonemes are written between slashes.
Morphosyntax. In modern linguistic theories, the mapping between form and meaning is assumed to be mediated by a central component called (morpho)syntax which contains information about the category of constituents and how they combine. Thus, each linguistic constituent is described by a form-categorymeaning triple ⟨f, c, m⟩. In the present work, these components consist of: 1. Form: A specification of the sound structure of the constituent. In this work, we will use sequences of phonemes described as phonetic feature vectors. For example, the English (en) past tense verb form walked is represented as [wOkt]. This complex constituent consists of two smaller pieces, the past tense suffix which is represented underlyingly as /d/ and the stem /wOk/.

Morphosyntactic Category:
A specification of the category of the constituent and how it combines with other constituents. The category of walked is verb V. The past tense marker /d/ has category V\V meaning that it must attach to a verb stem on it's left to produce a verb. Analogously, a morpheme that is prefixed (e.g., 're-', as in 'reanalyze', 'redo', ...) would have the morphosyntactic category V/V. In the main text, we used the more intuitive, less formal markers pfx, stem, sfx for V/V, V, V\V, respectively.

3.
Meaning: A specification of the meaning. In this work, we assume meanings are just sets of atomic meaning features. So the meaning of walked is [stem:walk;tense:past]. The meaning of the past tense marker /d/ is [tense:past] and the meaning of the stem walk is [stem:walk].
Thus, walked will be represented as ⟨[wOkt], V, [stem:walk;tense:past]⟩. A grammar G can be thought of as a joint distribution on the set X of form-category-meaning triples: ⟨f, c, m⟩ ∈ X and is specified by four components G = ⟨L, S, · , · ⟩. 1 First, there is a stored collection of primitive or atomic units known as the lexicon L. Each lexical item is also a form-category-meaning triple. Lexical items are assembled into more complex structures using an inventory of structure building operations S. In our model, S corresponds to concatenation of morphemes. Finally, assembled morphosyntactic structures are mapped to surface sound and meaning representations by a pair of functions called interface mappings which take syntactic objects to sound structure · (phonology) and meaning structure · (semantics) respectively.
For example, in English, the stem for the verb walk is ⟨/wOk/, V, [stem:walk]⟩ ∈ L EN . In our model, we adopt a single structure-building operation σ(·, ·) that operates on pairs of constituents concatenating their form parts, algebraically canceling their category parts, and performing unification on their meaning. For example, in English the past tense is marked regularly by the lexical item ⟨/d/, V\V, [tense:past]⟩. So the past tense form walked can be constructed as Note that the output of σ represents the underlying form /wOkd/-on the surface the final /d/ is devoiced in the context of the voiceless /k/; that is, it is pronounced [wOkt]. The set of all possible underlying forms is simply the closure of the lexicon L under σ. We write U L for the set of all underlying forms in a language with lexicon L-that is, the set of all structures derivable from L using σ.
The interface function · is a set of ordered transduction (SPE-style) rules which map underlying to surface forms. In the present work, we leave · as the identity function.
Using the formalism introduced above, the theory-induction objective corresponds to finding a set of phonological transduction rules · , semantic transduction rule · and lexicon L maximizing

S2.3 Learned Fragment-Grammar meta-model
Fig. S5 shows our basic context free grammar over SPE-style rewrites, which defines a space of possible programs for modelling phonological rules. In learning a meta-model, we build on top of this hand-coded basis. Specifically, we learn by adding more context-free production rules to this basic grammar. This works by learning additional commonly occurring fragments of rules (Fig. S6), which biases the model toward reusing those fragments. 1 In the main paper we factor grammars into rules T and lexicon, L. This section combines these two factors into a single tuple, the grammar G, and also incorporates morphosyntactic category information in the observed data, which we elided from the main text. Our system assumes every observed (form, meaning) pair has the same morphosyntactic category. ⟨/iN/, pfx, A⟩ ⟨ϵ, sfx, A⟩ ⟨/çele/, stem, frog⟩ ⟨/xa/, pfx, big⟩ ⟨ϵ, sfx, big⟩ ⟨/tali/, stem, beer⟩ · · · a frog→/iN/+/çele/+ϵ →/iNçele/  (unobserved) theory L u m a s a a b a ⟨ϵ, pfx, gen⟩ ⟨/un/, sfx, gen⟩ ⟨/t S an/, stem, bell⟩ ⟨ϵ, pfx, NomPl⟩ ⟨/köj/, stem, village⟩ ⟨/lar/, sfx, NomPl⟩ ⟨/el/, stem, hand⟩ · · · bell (gen)→ϵ+/t S an/+/un/→/t S anun/ T u r k i s h Figure S1: Given dataset, highlighted in orange, system jointly infers both language-specific phonological rules ("theory" box, labeled r 1 , r 2 , etc.) and dataset-specific lexicon, which include both stems and affixes for each inflection. Together the theory and lexicon explain the orange data via a derivation where the morphology output (prefix+stem+suffix) is transformed according to the ordered rules. The symbol ϵ means the empty string. theory H u n g a r i a n Figure S2: Given dataset, highlighted in orange, system jointly infers both language-specific phonological rules ("theory" box, labeled r 1 , r 2 , etc.) and dataset-specific lexicon, which include both stems and affixes for each inflection. Together the theory and lexicon explain the orange data via a derivation where the morphology output (prefix+stem+suffix) is transformed according to the ordered rules. The symbol ϵ means the empty string.  Figure S3: Example failure modes for our system; illustration is analogous to Fig. S1-S2. Somali rule system fails to explain 20% of the textbook problem, and many of the individual rules are implausible upon inspection, such as the first copying rule r 1 -see the bottom of derivation of the singular form of "mule." Other rules are essentially correct, such as the spirantization process implemented by rule r 2 , or the neutralization process in r 6 . Yowlumne rule system contains a redundant vowel rounding process, r 4 , which acts in concert with r 3 to repeatedly harmonize segments to /u/; see illustration of the derivation of the non-future form of drink (bottom of observed data/unobserved derivation box). ("Yowlumne" was formerly known as "Yawelmani" [1].)  Figure S4: Learning curves artificial for grammar learning (compare with Fig. 6). The x-axis of each plot varies the number of training words, each drawn from the 'consistent' grammar. The y-axis of each plot compares the likelihood of test words from the consistent and inconsistent grammars according to the log odds ratio log P (consistent|train)/P (inconsistent|train). Values greater than 0 indicate successful discrimination between the consistent and inconsistent grammars. Curves show mean and standard deviation over n = 15 random test word pairs conforming to the (in)consistent grammars. Green: w/ syllabic representation. Red: w/o syllabic representation. Source data are provided as a Source Data file.
Template for a single rule −feature appended to matrix Phoneme ::= @ | a | g | · · · a constant phoneme Feature ::= voice | nasal | coronal | · · · a phonological feature integer indices are used for copying, see caption Z ::= −2 | − 1 | 1 | 2 copying target, an integer α ::= −2 | − 1 | 1 | 2 place copy target, an integer Figure S5: Context-free grammar generating phonological rules used by our system. Non-terminal symbols begin with a capital letter, as well as Z and α. For increased tractability, we arbitrarily bound the size of each FeatureMatrix to have had most three features, and as outlined in the above grammar over SPE rules, each trigger may have at most two feature matrices, hence the maximum range of copying targets (Z/α) of ±2. Copying targets are expressed as integers indexing into the triggering environments. For example the rule V → V i / V i CC is expressed using the above grammar as V → -1 / V CC, while the rule V → V i / C C 0 V i would be expressed as V → 2 / C C 0 V.

S3.1 Input data format
With the exception of allophony problems, each textbook problem consists of a matrix of surface forms, where the columns correspond to different inflections, and the rows correspond to different stems. Matrix entries can be empty (unspecified), either due to missing data, or due to a particular inflection not applying to a particular stem (for example, there is no past tense form of a stem like "pineapple" because it is a noun and not a verb). Slightly overloading notation used in the main manuscript, we refer to this matrix as X, and can index the rows of X (lexemes) and the columns of X (inflections). For example, Figure S7A shows a paradigm matrix X for a basic problem from Russian. There are 2 inflections (nominative and genitive), corresponding to the columns of the matrix, and 4 stems, corresponding to the rows of the matrix. Allophony problems consist of a set of surface forms along with a set of pairs of phonemes, known as 'allophones', which we treat as a substitution on phonemes. Figure S7B shows an allophony problem from Mohawk.

S3.2 Counter-Example Guided Inductive Synthesis
We adapted counter-example guided inductive synthesis (CEGIS: [2]) to our setting ('CEGIS' in Figure 5) by maintaining (1) a 'current' theory, (2) a 'covered' set of rows of X for which there exists a stem such that the the current theory is consistent with that row; and then repeatedly (3) searching for the next counterexample, i.e. a row of X which is inconsistent with the current theory, and then using the Sketch program synthesizer to (4) update the current theory to be the MAP estimate of the theory that explains both this counterexample and the covered examples, and then (5) adding this most recent example to the 'covered' set. Because Sketch requires a finite program space we must also bound the number of rules searched over during step (4). We allow Sketch to search over at most K + 1 rules, where K is the number of rules in the most recent current theory.

S3.3 Incremental synthesis of grammars
Our incremental approach to synthesizing phonological grammars combines and generalizes counterexample guided synthesis with test-driven synthesis [3]. Similar to our CEGIS algorithm, we maintain a 'current' theory as well as a set of 'covered' examples, and repeatedly search for a (non-covered) counterexample to the current theory. However, rather than resolve from scratch for a new theory accommodating both the counterexample and the covered examples, we search only over those theories close in edit distance to the most recent theory.
We define the edit distance, d(T 1 , T 2 ), between a pair of theories T 1 , T 2 by counting the number of insertions, deletions, substitutions, and swaps that separate the sequences of rules for T 1 and T 2 . Any modification to a rule counts as a complete substitution, so entire rules are resynthesized wholesale rather than, e.g., have individual feature matrices 'edited'. This coarse-grained notion of edit distance has the advantage that it can be easily encoded in a SAT solver, and we hypothesize it may be less prone to getting trapped in local optima because it encourages larger search moves.
For each counterexample, we progressively increment the maximum edit distance until Sketch discovers a satisfying solution. We furthermore 'minibatch' counterexamples, grouped by lexeme (row of X) and ordered according to the textbook problem. We automatically set the number of lexemes in each batch differently depending on the number of inflections (columns of X) such that the surface forms in a minibatch will be no more than nine (ie, with 3 inflections, each minibatch will comprise 3 lexemes; with 4 inflections, each minibatch will comprise 2 lexemes). We conjecture that larger batch size generally leads to better convergence, because this exposes the SAT solver to more data at once, which on balance should lead to less myopic search moves. Yet larger batch sizes increase compute requirements, both because the size of the SAT problem grows linearly with batch size and because the search radius may need to grow larger with increased batch size. Accordingly, for allophony alternation problems, we batch the entire problem at once, because these problems are much easier. Our selection of a minibatch size of 9 was motivated by informal pilot experiments suggesting that after around 9 new words the solver performance degraded severely; due to the high compute cost of running these simulations, we did not perform a systematic hyperparameter sweep, and the 'optimal' batch size may differ from the one used. 2 As a concrete example, consider a data set of English verb inflections in infinitive and third-person plurals. Suppose the batch size is two. If the first paradigm row is [mit] and [mits] (the pronunciations of the words "meet" and "meets"), then these are the first two words that the system considers. Initially, T 0 contains no rules. So, these words serve as a counterexample, because the third-person-singular morphology, namely that a suffix must be appended (and that in the lexicon this suffix is recorded as /s/) has not yet been inferred. Running Sketch on this example would update the grammar to contain the third-person-singular suffix /s/, and introduce no new phonological rules. Suppose that the next paradigm row is [it] and [its] (the pronunciations of the words "eat" and "eats"). The system will find that the current grammar, when supplemented with the stem /it/, explains this example, and so it does not serve as a counterexample, because the morphophonology inferred from the first batch is consistent with this example. Suppose that the next row of the paradigm is [nid] and [nidz] (pronunciations of the words "need" and "needs"). There is no stem which can explain this paradigm row, given the current affixes and rules. Therefore it is a counterexample, and in the next iteration the system will accommodate this counterexample by introducing a phonological rule about explains the alternation between /s/ and /z/. Parallelism. A naive implementation of this approach would encode the edit-distance constraint directly into the Sketch system. A more efficient, parallelizable approach is to enumerate a finite set of theory templates, where each template corresponds to a family of edits to the original theory. A theory template is a list, where each list element is either a fixed rule or a new unknown rule that the Sketch solver will solve for. The theory templates compatible with current rules {r k } K k=1 is new rule, denoted ??
where we have followed the convention, from Sketch, of writing ?? for unknown parts of the program that need to be synthesized. With this subroutine in hand we can define a constraint, C Template , upon theories, datasets, and templates: where C imposes the constraint that T explains X (Eq. 4 of main text). This constraint can be passed to Sketch. We can now rewrite the incremental synthesis update equations from Eq. 6-7 of the main text to This refactoring of our algorithm exposes an opportunity for parallelism: Given a specific value of {r k } K k=1 drawn from Templates(T t , D) we can straightforwardly solve the above equations using Sketch, so we simply loop over each such template and allocate a parallel worker to it. An outer loop monotonically increases the search distance D, and a parallel inner loop checks if any D-distance template satisfies C Template . When we find a D with such templates we return the best satisfying T under F (X, ·). We used 40 CPUs in our experiments. Unfortunately, our method does not saturate the parallel compute resources of many-core machines until the number of rules grows large. Therefore, we observe very little speed up on easier problems. This is because the number of distinct templates is smaller when there are fewer rules. For example, editing a theory with a single rule r 1 yields the templates {r ′ 1 = ??} and {r ′ 1 = ??, r ′ 2 = ??}, and so we only have two parallel jobs (other templates, such as {r ′ 1 = r 1 , r ′ 2 = ??}, are subsumed by the latter template).

S3.4 Bayesian prior over grammars
As a baseline UG, we simply count the number of symbols present in the lexicon and rules; we heuristically penalize insertions, deletions, and constant phonemes by counting them as two symbols: Therefore the total theory cost is 5, and the theory prior is P (T) ∝ exp (−5). This is defined up to a constant proportionality. The normalizing constant is guaranteed to be well-defined because Sketch only considers finite program spaces, hence a finite number of possible rule sequences. Because we only care about the relative probabilities of candidate grammars, we can ignore the normalizing constant. In solving Fig. S7A the system also constructs a lexicon, which participates in the prior probability calculation. This contains affixes for the nominative and generative: where ϵ is the empty string. The total length of all affixes is 0 + 0 + 0 + 1 = 1, contributing a factor of e −1 to P (L). The latent stems also contribute factors to P (L), because they are members of L: L ⊃ {⟨vagon, stem, wagon⟩, ⟨avtomobil j , stem, car⟩, ⟨vet S er, stem, evening⟩, ⟨muZ, stem, husband⟩, ⟨karandaS, stem, pencil⟩, ⟨glaz, stem, eye⟩, ⟨porog, stem, threshold⟩} Above, the total length of the listed stems is 5+9+5+3+8+4+5=39, which contributes an additional factor of e −39 to P (L). If the Russian problem comprised only the data in Fig. S7A, then the prior probability of lexicon would be P (L) ∝ e −39 e −1 , the prior probability of the theory would be P (T) ∝ e −5 , and the total probability of the grammar would be P (L, T) ∝ e −39 e −1 e −5 .

S3.5 Ablation studies
We studied several ablations 3 of our system; see Fig. 5. We found that basic representational concerns matter most: one needs the right rule representation, which we think of as being part of universal grammar.
We studied the effect of changing the feature system, as well as the effect of ablating two key computational mechanisms (having features at all, and having Kleene star).
• We first changed the feature system from so-called 'articulatory' features to 'phonetic' features. Typically introductory phonology courses start by introducing phonetic features (features of sounds). Later one typically learns that these features can be more concisely and more generically expressed in terms of features of the motor actions required to produce those sounds ('articulation' • For a more drastic demonstration of the centrality of basic computational mechanisms, we further ablate all phonological features. The system can still express rules in terms of specific phonemes, but cannot generalize and analogize across phonemes. We also remove Kleene star, which means not allowing the system to express rules whose triggers abstract over the number of times a phoneme occurs. Recall that this is notated with a subscript 0 , thus all of our example rules with this subscript are unexpressible by this ablation. In principle, this ablation can still learn rules whose behavior is identical to the correct rules, simply by memorizing every phoneme for which a rule applies (due to the ablation of features), and every sequence length for which a rule applies (due to the ablation of Kleene star). In practice, the system no longer has the inductive bias to learn such generalizations; and furthermore, search becomes harder because the programs become much longer due to the need to memorize many specific cases.

S4 Full set of problems and model outputs
Allophony problems are given as a set of surface forms along with a set of pairs of phonemes. The goal of the student (as well as the goal of the model) is to recover rule(s) which predicts which element of each pair is the underlying form. Model outputs for alternation problems are of the form: Surface form UR a given surface form model's predicted underlying form · · · The surface form... Is underlyingly...

a phoneme phoneme
Followed by a sequence of rules output by the model. Non-allophony problems are given as a matrix of surface forms, where the columns range over different inflections and the rows range over different stems. Missing data is notated with a dash (−). We show the model's predicted concatenative morphology in the first row of each such matrix. In the penultimate column of each matrix we show the predicted underlying stems, and in the final column of each matrix we show the ground truth annotations. After each such matrix we show the rules output by the model. For example, illustrates a problem with two inflections, where the input to the model is the data colored purple, from which it synthesizes the morphology and stems in salmon, which should be compared with the ground truth annotation in bold. Manual rule grading: We evaluate the system's predicted rules on 15 randomly sampled problems, which are flagged with the text 'This problem was manually graded'. Notes from the grading process are provided for these problems. These notes specify the phonological processes (i.e. rules) that are deemed as 'ground truth'; which rules from the system correspond to those ground-truth processes; and which rules from the system are not part of the ground-truth solution (i.e. spurious rules).
Rules: Rules: Bukusu (in metatheory training set): This problem was manually graded: . The data is pathological in that the only place where these C's aren't V V is word-initial, and it's cheaper to write "non-initial". Implemented by rules 1 and 2.
No spurious rules.
No spurious rules.
No spurious rules. Model outputs no spurious rules.

67
Russian : pastuxa pastus j i -----pastux pastux m j n j ux m j n j uxa m j n j us j i -----m j n j ux mn j ux pluG pluGa Rules: